I provide two minimal examples to compare discrete and continuous smoothing to figure out the differences. The discrete will be performed by fitting a Beta regression using a GAM with Gaussian Markov Random Fields (GMRF) as basis. The MRF are the discrete counterpart of Gaussian processes (GP).
The GMRF smoothing requires a mesh grid (the graph/network/lattice call it as you like) to build and estimate the response based on the Markov blanket.
The GP smoothing takes the coordinates as input (so continuous input) to estimate the response using Gaussian processes.
The residuals are the geomtric ones: \(\mathrm{res} = \mathrm{freq}_{\mathrm{obs}}/\mathrm{freq}_{\mathrm{glm}}\) meaning that we really consider the Poisson distribution as valid. If \(\mathrm{res} <1\) the model overestimates the claim frequency and vice versa.
Summarizing what I’ve done in this notebook:
Moreover, I did the same pipeline on the PC3CODE and PC2CODE levels of coarse graining. I’ve done: - Use the aggregated version of the raw GLM residuals since I only have the raw residuals at the PC4CODE level. So it’s not 100% valid but it provides intuition of what would be the results. - Merge the polygons and the shape files at those two levels - Re-fit the GAM with GMRF basis - Plot the results, smoothed and raw residuals on the training set
The run is pretty fast, the entire file compile in +-15 min (can be re-written avoiding doing multiple times the same operation, will be then much faster). The most expensive part being the plotting chunks. The results (at least for the smoothing part) should be quite similar to the ICAR. The most discriminative part should be the clustering part since quantiles and hierarchical clustering are pretty different. The hierarchical clustering has the advantage to group similar areas (in terms of smoothed residuals) pairwise and build the structure. This means that zone 1 is more similar to zone 2 than to zone 3 (similarity order).
The quantiles will provide an uniform distribution among zones but two areas within a single zone might therefore be significantly different.
At the first stage I didn’t have a val set (only train and test already aggregated at the PC4 level). Ideally, the roughness parameter (hyper param of the GAM+GMRF) should be chosen by evalutation on the validation set. I set it by hand (as a guess value).
## OGR data source with driver: ESRI Shapefile
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
##
## Family: Beta regression(1.804)
## Link function: logit
##
## Formula:
## geom_res_freq ~ s(PC4CODE, bs = "mrf", k = roughness_param, xt = list(nb = nb))
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2372 0.0171 -130.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(PC4CODE) 123.5 143.2 2474 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -0.191 Deviance explained = -25.9%
## -REML = -22254 Scale est. = 1 n = 3952
However since we are speaking of residuals, I highly recommend a diverging colour palette. The differentiation between less or greater than 1 will be easier and the results better illustrated. Here is the rendering (the 0 are coded in red):
Please, forget even the existence of the jet colour map. They are so many papers on this but a picture is worth a thousand words:
One cannot distinguish anything using the jet colour map. Information that we hardly extracted from the data are hidden and non interpretable (no boundaries between positive and negative, cluster are fuzzy, non uniform and misleading colour intensity, etc.). If for an obscure reason, one really really wants a rainbow like colour map, at least one should use a uniformized rainbow map like the one provided by https://peterkovesi.com/projects/colourmaps/
## OGR data source with driver: ESRI Shapefile
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
##
## Family: gaussian
## Link function: identity
##
## Formula:
## geom_res_freq ~ s(XCOORD, YCOORD, bs = "gp", k = roughness_param,
## m = c(1, 0.175))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.89451 0.01177 75.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(XCOORD,YCOORD) 2 2 45.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0219 Deviance explained = 2.24%
## GCV = 0.54832 Scale est. = 0.54791 n = 3952
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
The thing is that GP are flexible but quite smooth, too smooth for the present application, besides continuous instead of discrete.
Note that clustering the smoothed residuals, meaning \(\mathrm{res}_\mathrm{smooth}\) instead of \(\mathrm{res}_\mathrm{gml}\), will return a much more homogeneous spatial distribution.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95 | 206 | 243 | 205 | 107 | 178 | 266 | 69 | 90 | 122 | 312 | 202 | 230 | 221 |
| 15 | 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|---|
| 60 | 77 | 244 | 237 | 393 | 395 |
Probably too many zones.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 292 | 449 | 383 | 468 | 206 | 122 | 312 | 451 | 244 | 1025 |
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
The zones are illustrated hereafter in qualitative colours. The zones are the levels of a categorical variable, the way this variable was built discretizing the smoothed residuals and ordered the cluster (hierarchy). Therefore it also makes sense to illustrate the zones in sequential colours. I provide this illustration below the qualitative colours chart.
The zones in sequential colours
As Allsecur did 30 zones, let’s re-cut the hierarchical tree and plot the zones for comparison.
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
and comparing the zones. Using the hierarchical clustering, several zones will be almost empty
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95 | 125 | 124 | 205 | 81 | 107 | 178 | 153 | 113 | 119 | 10 | 90 | 64 | 176 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 103 | 136 | 230 | 106 | 99 | 58 | 60 | 77 | 59 | 115 | 92 | 152 | 237 | 311 | 395 | 82 |
In the SAS procedure, the distribution is
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 3 | 5 | 4 | 3 | 3 | 5 | 13 | 25 | 73 | 148 | 234 | 237 | 246 | 289 | 281 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 293 | 391 | 478 | 473 | 316 | 177 | 123 | 89 | 53 | 36 | 17 | 10 | 4 | 9 |
In qualitative colours:
Note that since NAs have been encountered in the GMRF fitting (no observation for some zip code in the training set), one cannot predict a value for those zip code. Indeed they have been removed before the lattice setting because the lattice cannot be defined based on NAs. If you want to be able to predict for all zip code, make imputation or be sure to have at least one observation per zip code or reduce the granularity.
## OGR data source with driver: ESRI Shapefile
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
Let’s plot the prediction and the target values
Just to given an idea, I provide here the equivalent at a more coarse grained level: the PC3 zip code. To do that, I’ll take the aggregation of the provided GLM ncl just to avoid to re-predict the Emblem model at the PC2 level. If the results are satisfactory.
First let’s visualize the difference between the two levels of granularity PC4 and PC3. In the following maps, the fill is done using the PC4 or PC3 code. No quantitative meaning, just differentiation of the areas.
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## OGR data source with driver: ESRI Shapefile
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
and the exposure at the different levels:
## OGR data source with driver: ESRI Shapefile
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
##
## Family: Beta regression(12.594)
## Link function: logit
##
## Formula:
## geom_res_freq ~ s(PC3CODE, bs = "mrf", k = roughness_param, xt = list(nb = nb))
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65970 0.02455 -67.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(PC3CODE) 70.52 88.53 563.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -0.475 Deviance explained = 65.9%
## -REML = -958.91 Scale est. = 1 n = 798
fwrite(zones_gmrf[, .(PC4CODE, PC3CODE, PC2CODE, PC1CODE, zone30, zone20, zone10)], “PC4zones_gamgmrf.csv”)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 33 | 50 | 100 | 39 | 71 | 50 | 64 | 23 | 49 | 34 | 31 | 31 | 19 | 4 | 17 | 17 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 12 | 4 | 14 | 13 | 16 | 7 | 29 | 7 | 10 | 19 | 16 | 5 | 1 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 33 | 89 | 100 | 102 | 50 | 64 | 57 | 49 | 44 | 19 | 5 | 24 | 17 | 16 | 14 |
| 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|
| 32 | 26 | 7 | 29 | 21 |
Probably too many zones.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 57 | 189 | 216 | 106 | 61 | 24 | 30 | 39 | 26 | 50 |
The zones in qualitative colours
The zones in sequential colours